Data Analysis using Pandas

Question 1 (100 points)

Wonder Woman Captain Marvel
wonderwoman marvel

Women are involved in the film industry in all roles, including as film directors, actresses, cinematographers, film producers, film critics, and other film industry professions, though women have been underrepresented in all these positions. Studies found that women have always had a presence in film acting, but have consistently been underrepresented, and on average significantly less well paid.

In 2015, Forbes reported that "...just 21 of the 100 top-grossing films of 2014 featured a female lead or co-lead, while only 28.1% of characters in 100 top-grossing films were female... This means it’s much rarer for women to get the sort of blockbuster role which would warrant the massive backend deals many male counterparts demand (Tom Cruise in Mission: Impossible or Robert Downey Jr. in Iron Man, for example)".

Also, Forbes' analysis of US acting salaries in 2013 determined that the "...men on Forbes’ list of top-paid actors for that year made 2½ times as much money as the top-paid actresses. That means that Hollywood's best-compensated actresses made just 40 cents for every dollar that the best-compensated men made.

We will look to examine whether and how women representation is lacking in the film industry. To do this, we will adopt The Bechdel test as a measure of the representation of women in the film industry. The test is named after the American cartoonist Alison Bechdel in whose 1985 comic strip Dykes to Watch Out For the test first appeared.

We will get the data ourselves to perform the analysis. Specifically, we will retrieve the movie metadata from IMDB (Internet Movie Database), an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. As of January 2021, IMDb has approximately 6.5 million titles (including episodes) and 10.4 million personalities in its database, as well as 83 million registered users.

The IMDb Top 250 is a list of the top rated 250 films, based on ratings by the registered users of the website using the methods described. We will focus on these famous movies in this analysis:

Task 1

We will retrieve the metadata of IMDb Top 250 movies from the IMDb charts. For each movie on the list, we can scrape the following characteristics from the information page. For example, from the page of top rated movie "The Shawshank Redemption", we want to extract the metadata about this movie as:

imdb

Web Scrapping

Scrapping IMDb website for the respective attributes

Dataset Exploration

Question 1:

If you group the movies by release years, show the number of movies at each decade in a descending order.

Quesion 1.3 (5 points) Show the number of movies by the distribution of runtime at quartile (0-25%, 25-50%, 50-75%, 75-100%).

Question 1.4 (5 points) What is the proportion of movies that have Budget higher than 75% of all movies (i.e. the third quartile)?

Question 1.5 (5 points) Show the top 10 most popular actor/actresses in terms of number of movies they have starred.

Question 1.6 (5 points) Show the top 5 directors with the most total box office revenues.

Question 1.7 (5 points) Show the average ratings of movies across the genres and decades.

Question 1.8 (5 points) Creat a new column ROI that measures the return on investment using the (box revenue-budget)/budget, and compare the ROI between movies in English and those in non-English. Use the t-test to examine whether such difference is statistically significant (You can use scipy.stats.ttest_ind to test the mean difference of two distributions)

Therefore we can conclude that the difference between English versus Non-English movies is not statistically significant given the very high p_value

Question 1.9 (5 points) Do the commercially successfuly movies also receive higher ratings. Check the correlations between box office revenues and ratings using Pearman and Spearman correlations.

Question 1.10 (10 points) Now let's retrieve data from Bechdel Test Movie website for each movie. You can send the requests to the API: https://bechdeltest.com/api/v1/doc#getMovieByImdbId. For example, for the movie The Shawshank Redemption (the IMDb id: 0111161), you can simply call: http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid=0111161.

Create a dataframe bechdel_imdb_top that merge the bechdel test info with the imdb_top_movies show how many top 250 movies are also in the bechdel test website.

Add data from Bechdel Test Movie website

I have kept the imdb_top_movies.csv file as originally scrapped from the website and chose to keep the changes dependend on running the whole data analysis above, therefore, columns created above such as ROI (%) and columns that were cleaned, are necessary for the next analysis but that code needs to be run first, to get the columns and the most updated/cleaned version of the imdb_top_movies dataframe.

Question 1.11 (5 points) Show how many movies in terms of percentage) that has passed the test in different ways (Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test)

Question 1.12 (5 points) Show the percenage of movies given differen genres that has passed the test in different ways (Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test))

Question 1.13 (5 points) Show the top 10 highest-rated movies that passed the test completely (rating=3)

Question 1.14 (5 points) Compareing the movies that passed (rating=3) and failed the test (rating=0), are their ROI different? Explain.

Question 1.15 (10 points) Now load the bechdel_imdb.json that contains the all movies that are rated by the Bechdel Test website. Are women representation improved over the decades? Create a dataframe bechdel_imdb, comparing the top 250 and other movies, in terms of percentage, how many passed/failed the test?

As we can observe in the table above, female representation has increased over the decades with the difference between the 90's and the 2000's being the largest jump, more than doubling on fair representation of women in the turn of the century